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(54) Title: WRITE ANYWHERE FILE-SYSTEM LAYOUT 
(57) Abstract 

The present invention provides a mediod for keeping a file system in a consistent stole and 'T^^ 
system Changes to the file system arc tightly controlled. Tbe file system progresses from one consistent state to another. ^« 
S^nsSnt 'blocks on disk L is rooted byAe root inode is refe^ to as a consistency ^jj?^^ Z 

data is written to unallocated blocks on disk. A new consistency point occurs when die femfo <2440) ^ 

^ for the inode file (1210) into it Tbus, as k>ng as the root inode is not updated, the stete of^^csys^ l^^t Z^o 
^ not change. Tbe present invention also creates snapshots {Figure 22) that are readonly copies of the file ^y^^^^^J^^^^ 
d^soace whL it is iSy created. It is designed so diat many 

fi e'^s^^ tL dupEcatiS the entire inode file and aH of the indirect blocks, Ae present ^y^^^^<^ 

^e 'm<£ that describes the inode file. A multi-bit free-block map file (1630) is used to prevent data from bemg overwritten on disk. 
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WRITE ANYWHFRF FTTt;.« ; y stfm I Avn ^rr 
BACKGROTIMn OF THK TTSJVTTMTTnivT 

1- FIELD OF THE INrVKNTTOM 

The present invention is related to the field of methods and apparatus 
for maintaining a consistent file system and for creating read-only copies of the 
file system. 
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RArK:m?OTJND thf. tnvention 

1. TTTFT.D OF TH V TMVBNrnON 

The present invention is related to the field of methods and apparatus 
for maintaining a consistent file system and for creating read-only copies of tiie 
file system. 

10 z RArym^niiND ART 

All file systems must maintain consistency in spite of system failure. A 
number of different consistency techniques have been used in the prior art for 
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20 



this purpose. 

One of the most difficult and time consuming issues in managing any 
file server is making backups of file data. Traditional solutions have been to 
copy the data to tape or otiier off-line media. Witii some file systems, the file 
server must be taken off-line during tiie backup process in order to ensure that 
the backup is completely consistent. A recent advance in backup is tiie abiUty 
to quickly "done" (i.e., a prior art metiiod for aeating a read-only copy of the 
file system on disk) a file system, and perform a backup firom tiie done instead 
of from tiie active file system. Witii tins type of file system, it allows tiie file 
server to remain on-lme during tiie backup. 



25 
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File System Consistenry 



A prior art ffle system is disclosed by Chutani, et al. in an article entitled 
The Episode File System, USENIX, Winter 1992, at pages 43-59. The article 
5 describes the Episode file system which is a file system using meta-data (Le., 
inode tables, directories, bitmaps, and indirect blocks). It can be used as a stand- 
alone or as a distributed file system. Episode supports a pluraHty of separate 
file system hierarchies. Episode refers to the pluraUty of file systems 
collectively as an "aggregate". In particular. Episode provides a done of each 
10 file system for slowly changing data. 

In Episode, each logical file system contains an "anode" table. An anode 
table is the equivalent of an inode table used in file systems such as the 
Berkeley Fast File System. It is a 252-byte structure. Anodes are used to store 

15 all user data as weU as meta-data in the Episode file system. An anode 
describes the root directory of a file system including auxiUary files and 
directories. Each such file system in Episode is referred to as a "fileset". All 
data withm a fileset is beatable by iterating through the anode table and 
processing each file in tuni. Episode creates a read-only copy of a file system, 

20 herein referred to as a "clone", and shares data with the active file system 
using Copy-On-Write (COW) techniques. 

Episode uses a logging technique to recover a file system(s) after a system 
crashes. Logging ensures that the file system meta-data are consistent. A 
25 bitmap table contains information about whether each block in the file system 
is aUocated or not. Also, the bitmap table indicates whether or not each block 
is logged. All meta-data updates are recorded in a log "container" that stores 
transaction log of the aggregate. The log is processed as a circular buffer of disk 
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blocte. -me transaction logging of Episode uses logging techniques originally 
developed for databases to ensure file system consistency. This technique uses 
carefully order writes and a recovery program that are supplemented by 
database techniques in the recovery program. 

Other prior art systems including JFS of IBM and VxFS of Veritas 
corporation use various forms of transaction logging to speed the recover 
process, but still require a recovery process. 

10 Anoflier prior art method is called the "ordered vmte" technique. It 

^es all disk blocks in a carefully detennined order so ti.at damage is 
,rinimized when a system faUure occurs while performing a series of related 
vMtes. The prior art attempts to ensure tt«t inconsistencies that occur are 
harmless. For instance, a few unused blocks or inodes being marked as 

15 allocated. The primary disadvanUge of this technique is that tt,e restrictions 
places on disk order make it hard to achieve high performance. 

Yet anottier prior art system is an elaboration of a.e second prior art 
method referred to as an "ordered write wifl. recovery" tedmique. In this 
20 mett.od,inconsUtendes can be potentially harmfut However, tt,e order of 
writes is restricted so that inconsistencies can be found and fixed by a recovery 
program. Examples of this mett.od include ti.e original UNIX file system and 
Berkeley Fast FUe System (FFS). TWs technique does reduce disk ordering 
suffidenfly to eliminate the performance penalty of disk ordering. Anofl.er 
25 disadvantage is tt>at the recovery pro«ss is time consuming. It typically . 
proportional to the size of the file sys^m. Tlterefore, for example, recovering 
5 GB FFS file system requires an hour or more to perform. 
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File System Clonps 

Figure 1 is a prior art diagram for the Episode file system illustratiiig the 
use of copy-on-write (COW) techniques for aeating a fileset done. Anode 110 
comprises a first pointer llOA having a COW bit that is set. Pointer llOA 
references data blodc 114 directly. Anode 110 comprises a second pointer HOB 
haying a COW bit that is deared. Pointer HOB of anode references indirect 
blodc 112. Indirect blodc 112 comprises a pointer 112A that references data 
blodc 124 direcUy, Tlie COW bit of pointer 112A is set. Indirect blodcl 12 
10 comprises a second pointer 112B that references data blodc 126. The COW bit of 
pointer 112B is deared. 

A done anode 120 comprises a first pointer 120A that references data 
blodc 114. The COW bit of pointer 120A is deared. The second pointer 120B of 
15 done anode 120 references indirect blodc 122. The COW bit of pointer 120B is 
deared. m turn, indirect blodc 122 comprises a pointer 122A that references 
data block 124. The COW bit of pointer 122A is deared. 

As iUustrated in Figure 1, every direct pointer llOA, IIIA-IUB, 120A, 
20 and 122A and indirect pointer HOB and 120B in the Episode file system 

contains a COW bit. Blodcs that have not been modified are contained in both 
the active file system and the done, and have set (1) COW bits. Hie COW bit is 
cleared (0) when a blodc that is referenced to by the pointer has been modified 
and, therefore, is part of the active file system but not the done 



25 



a new 



When a copy-on-write blodc is modified, as shown in Figure 1, 
blodc is allocated and updated. The COW flag in the pointer to this new blodc 
is then set. The COW bit of pointer llOA of original anode 110 is deared. 
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Thus, when the done anode 120 is created, pointer 120A of done anode 120 
references data blodc 114 also. Both original anode 110 and done anode 120 
reference data blodc 114. Data blodc 124 has also been modified as indicated by 
a deared COW bit of pointer 112A in original indirect blodc 112. Thus, when 
5 the done anode is created, indirect blodc 122 is created. Pointer 122A of 

indirect blodc 122 references data blodc 124, and the COW bit of pointer 122A is 
deared. Both indirect blodc 122 of the original anode 110 and indirect blodc 122 
of done anode 120 reference data block 124. 

10 Figure 1 Ulustrates copying of an anode to create a done anode 120 for a 

single file. However, done anodes must be created for every file having 
. dianged data blodcs in the file system At the time of the done, all inodes 
must be copied. Creating done anodes for every modified file in the file 
system can consume significant amounts of disk space. Further, Episode is not 

15 capable of supporting multiple dones since each pointer has only one COW 
bit. A single COW bit is not able to distinguish more than one done. For 
more than one done, there is not a second COW bit that can be set 

A fileset "done" is a read-only copy of an active fileset wherein the 
20 active fileset is readable and writable. Qones are implemented using COW 
tedmiques, and share data blodcs with an active fileset on a blodc-by-blodc 
basis. Episode implements doning by copying eadi anode stored in a fileset. 
When initially doned, both tiie writable anode of the active fileset and the 
doned anode both point to the same data blodc(s). However, the disk 
25 addresses for direct and indirect blodcs mtiie original anode are tagged as 

COW. Thus, an update to the writable fileset does not affect the done. When 
a COW blodc is modified, a new blodc is allocated in tixe file system and 
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updated with the modification. The COW flag in the pointer to this new block 
is cleared 



The prior art Episode system creates dones that dupUcate the entire 
5 inode file and all of the indirect blocks in the file system. Episode dupUcates all 
inodes and indirect blocks so that it can set a Copy-Qn-Write (COW) bit in all 
pointers to blocks that are used by both the active file system and the done. In 
Episode, it is important to identify these blodcs so that new data written to the 
active file system does not overwrite "old" data that is part of the done and, 
10 therefore, must not change. 

Creating a done in the prior art can use up as much as 32 MB on a 1 GB 
disk. The prior art uses 256 MB of disk space on a 1 GB disk (for 4 KB blodcs) to 
keep eight dones of the file system. TTius, the prior art cannot use large 

15 numbers of dones to prevent loss of data. Instead it used to fadlitate badcup of 
the file system onto an auxiliary storage means other than the disk drive, sudi 
as a tape badcup device. Qones are used to badcup a file system in a consistent 
state at the instant the done is made. By doning the file system, the done can 
be backed up to the auxiliary storage means without shutting down the artive 

20 file system, and thereby preventing users fi-om using the file system Thvs, 
dones aUow users to continue accessmg an active file system while the file 
system, in a consistent state is badced up. Hien the done is deleted once the 
badcup is completed. Episode is not capable of supporting multiple dones 
since eadi pointer has only one COW bit. A single COW bit is not able to 
25 distinguish more than one done. For more than one done, there is no second 
COW bit that can be set. 



wo 94/29807 



PCTAJS94/Ofi320 



-7- 



10 



A disadvantage of the prior art system for aeating file system clones is 
that it involves dupUcating all of the inodes and all of the indirect blocks in 
the file system. For a system with many small files, the inodes alone can 
consume a significant percentage of the total disk space in a file system. For 
example, a 1 GB file system that is filled with 4 KB files has 32 MB of inodes. 
Thus, aeating an Episode done consumes a significant amount of disk space, 
and generates large amounts (Le., many megabytes) of disk traffic. As a result 
of these conditions, creating a done of a file system takes a significant amount 
of time to complete. 

Another disadvantage of the prior art system is that it makes it difficult 
to create multiple dones of the same file system. The result of this is that 
dones tend to be used, one at a time, for short term operations sudx as baddng 
up the file system to tape, and are then deleted. 
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SUMMARY OF THT? INVENTinNJ 

The present invention provides a method for maintaining a file system 
in a consistent state and for creating read-only copies of a file system. Changes 
5 to the file system are tightly controUed to maintain the file system in a 
consistent state. Hie file system progresses from one self-consistent state to 
another self-consistent state. The set of self-consistent blocks on disk that is 
rooted by tiie root inode is referred to as a consistency point (CP). To 
implement consistency points, WAFL always writes new data to unallocated 
10 blocks on disk. It never overwrites existing data. A new consistency point 
occurs when the fsinfo block is updated by writing a new root inode for the 
inode file into it. Tlius, as long as the root inode is not updated, the state of the 
file system represented on disk does not change. 

15 The present invention also creates snapshots, which are virtual 

read-only copies of the file system. A snapshot uses no disk space when it is 
initially created. It is designed so that many different snapshots can be aeated 
for the same file system. Unlike prior art file systems that aeate a done by 
dupUcating the entire inode file and all of the indirect blocks, the present 

20 invention dupUcates only the inode that describes the mode file. Thus, tiie 
actual disk space required for a snapshot is only the 128 bytes used to store the 
dupUcated inode. The 128 bytes of the present invention required for a 
snapshot is significantiy less tiian the many megabytes used for a done in the 
prior art 
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The present invention prevents new data written to the active file 
system firom overwriting "old" data tiiat is part of a snapshoKs). It is necessary 
that old data not be overwritten as long as it is part of a snapshot This is 
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accomplished by using a multi-bit free-block map. Most prior art file systems 
use a free block map having a single bit per block to indicate whether or not a 
block is allocated. The present invention uses a block map having 32-bit 
enfries. A first bit indicates whether a block is used by the active file system, 
5 and 20 remaining bits are used for up to 20 snapshots, however, some bits of 
the 31 bits may be used for other purposes. 
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BRIEF DESCR IPTION OF THE DRAWINf;<; 

Figure 1 is a block diagram of a prior art "done" of a file system. 

5 Hgure 2 is a diagram illustrating a Ust of inodes having dirty buffers. 

Kgure 3 is a diagram iUustrating an on-disk inode of WAFL. 

Figures 4A-4D are diagrams iUustrating on-disk inodes of WAFL having 
10 different levels of indirection. 

Rgure 5 is a flow diagram illustrating the method for generating a 
consistency point. 

15 Figure 6 is a flow diagram illustrating step 530 of Figure 5 for generating 

a consistency point. 

Rgure 7 is a flow diagram illustrating step 530 of Kgure 5 for creating a 
snapshot. 

10 

Hgure 8 is a diagram illustrating an incore inode of WAFL according to 
the present invention. 

Figures 9A-9D are diagrams illustrating mcore inodes of WAFL having 
S different levels of indirection according to the present invention. 

Figure 10 is a diagram illustrating an incore inode 1020 for a file. 
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Figures IIA-IID are diagrams illustrating a block map (blkmap) file 
according to the present invention. 

Figure 12 is a diagram illustrating an inode file according to the present 
5 invention. 

Figures 13A-13B are diagrams illustrating an inode map (inomap) file 
according to the present invention. 

10 Figure 14 is a diagram illustrating a directory according to the present 

invention. 

Figure 15 is a diagram illustrating a file system information (fsinfo) 
structure. 

15 

Figure 16 is a diagram illustrating the WAFL file system. 

Figures 17A-17L are diagrams illustrating the generation of a consistency 

point. 

20 

Figures 18A-18C are diagrams illustrating generation of a snapshot. 
Figure 19 is a diagram illustrating changes to an inode file. 

25 Hgure 20 is a diagram illustrating fsinfo blocks used for maintaining a 

file system in a consistent state. 
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Rgures 21A-21F are detailed diagrams illustrating generations of a 
snapshot. 

Figure 22 is a diagram illustrating an active WAFL file system having 
5 three snapshots tiiat each reference a common file; and. 

Figures 23A-23B are diagrams illustrating the updating of atime. 
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nT7TAn.F.n PESCR TPrrnM of THE PT?F.SRNT INV ENTION 

A system for creating read-only copies of a file system is described. In 
the following description, numerous specific details, such as number and 

5 nature of disks, disk block sizes, etc, are described in detail in order to provide 
a more thorough description of the present invention. It will be apparent, 
however, to one skilled in the art, that the present invention may be practiced 
without these specific details. In other instances, well-known features have 
not been described in detail so as not to unnecessarily obscure the present 

10 invention. 

WPTTT7 ANTYWHERE Fn.P.-SYSTEM LAYOUT 

The present invention uses a Write Anywhere File-system Layout 
15 (WAFL). This disk format system is block based (i.e., 4 KB blocks that have no 
fragments), uses inodes to describe its files, and includes directories that are 
simply specially formatted files. WAFL uses files to store meta-data tiiat 
describes tiie layout of tiie file system. WAFL meta-data files include: an 
inode file, a block map (blkmap) file, and an inode map (inomap) file. The 
20 inode file contains the inode table for the file system. The blkmap file 
indicates which disk blocks are allocated. The inomap file indicates which 
inodes are allocated. On-disk and incore WAFL inode distinctions are 
discussed below. 



25 On-nisk WAFL Inodes 



WAFL inodes are distinct from prior art inodes. Each on-disk WAFL 
inode points to 16 blocks having tiie same level of indirection. A block 



1 1 
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indirection in an inode better fadUtates recursive processing of a file. Figure 3 
is a block diagram illustrating an on-disk inode 310. "Die on-disk inode 310 is 
comprised of standard inode information 310A and 16 block number entries 
310B having the same level of indirection. The inode information 310A 
5 comprises information about the owner of a file, permissions, file size, access 
time, etc. tiiat are weli-known to a person skilled in tiie art. On-disk inode 310 
is unlike prior art inodes that comprise a plurality of block numbers having 
different levels of indirection. Keeping all block number entries 310B in an 
inode 310 at the same level of indirection sunplifies file system 
10 implementation.* 



For a smaU file having a size of 64 bytes or less, data is stored directly in 
tiie inode itself instead of tiie 16 block numbers. Figure 4A is a diagram 
iUusti-ating a Level 0 inode 410 that is similar to inode 310 shown in Figure 3. 
15 However, inode 410 comprises 64-bytes of data 410B instead of 16 block 
numbers 310B. Tlierefore, disk blocks do not need to be aUocated for very 
small files. 

For a file having a size of less tiian 64 KB, each of tiie 16 block numbers 
20 direcdy references a 4 KB data block. Figure 4B is a diagram illustrating a Level 
1 inode 310 comprising 16 block numbers 310B. The block number entries 0-15 
point to corresponding 4 KB data blocks 420A-420C. 

For a file having a size that is greater tiian or equal to 64 KB and is less 
25 tiian 64 MB, each of the 16 block numbers references a single-indirect block. In 
him, each 4 KB single-indirect block comprises 1024 block numbers tftat 
reference 4 KB data blocks. Figure 4C is a diagram illustrating a Level 2 inode 
310 comprising 16 block numbers 310B tiiat reference 16 single-indirect blocks 



wo 94/29X17 



KrrnJS94;o6320 



•15- 



430A-430C. As shown in Figure 4C, block number entry 0 points to 
single-indirect block 430A. Single-indirect block 430A comprises 1024 block 
„™>bers that reference 4 KB data blo<ta 440A-440C. Similarly, singl^indirect 
blocks 430M30C can eaA address up to 1024 data blocks. 

For a ffle size greater than 64 MB, fte 16 block numbers of the inode 
reference double-indirect blocks. EaA 4 KB doubl^indirect block comprises 
1024 block numbers pointing to corresponding single-indirect blocks. In turn, 
each singl^indirect block comprises 1024 block numbers that point to 4KB data 
10 blocks. Thus,up.o64GBcanbeaddressecL Hgure 4D is a diagram illustrating 
a Level 3 inode 310 comprising 16 block numbers 310B wherein btod. number 
entries 0, 1, and 15 reference double-indirect blocks 470A, 470B, and 470C, 
respectively. Double-indirect block 470A comprises 1024 block number entries 
0-1023 that point to 1024 single-indkect btocks 480A-480B. Bad. single-indirect 
15 block 480A-480B, in turn, references 1024 data blocks. As shovm in Hgure 4D, 
singk«ndirect block 480A references 1024 data blocks 490A-490C and 
inngle-indirect block 480B references 1024 data btocks 490C-490F. 



20 
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jr-r" V**^. Inodes 

Hgure 8 is a block diagram iUustrating an incore WAFL inode 820. The 
ir.core inode 820 comprises the information of on-disk inode 310 (shown in 
Hgure 3), a WAFL buffer data structure 820A, and 16 buffer potaters 820B. A 
WAFL incore inode has a size of 300 bytes. A WAFL buffer is an incore (in 
memory) 4 KB equivalent of the 4 KB blocks that are stored on disk. Incore 
inode 820 is unlike prior art inodes that reference buffers having different 
levels of indirecaoru Each inc«,re WAFL inode 820 points to 16 buffers having 
the same level of indirection. A buffer pointer is 4-bytes long. Keepingall 
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file system implementeaon. Ircore mode 820 also contatas mcore 
infonnation 820C comprising a dirty flag, an in-consistency point (IN CP) flag, 
»d pointers for a linked list The dirty flag indicates that the inode itself has 
been modified or that it references buffers tha, have changed. Ihe IN CP flag 

5 '^'^edtomarkaninodeasbeingir.aconsis.encypointCdescribedbelow). the 
pointers for a linked list are described below. 

Hgure 10 is a diagram iUusfrating a file referenced by a WAFL mode 
lOJO. The file comprises indirect WAFL bufiers 1020-1024 and direct WAFL 
10 buffers 1030-1034, The WAFL in.<x>re inode ,010 comprises standard inode 
mformation lOlOA (including a com,t of dirty buffers), a WAR buffer data 
structure lOlOB, 16 buffer pointers lOlOC and a standard on-disk inode lOlOD 
The in-core WAFL inode 1010 has a size of approximately 300 bytes. The 
on-disk in«Je is 128 bytes in size. The WAFL buffer data structure lOlOB 
comprises two pointers where the first one references the 16 buffer pointers 
lOlOC and the second references the on-disk block numbers lOlOD. 

Each inode 1010 has a count of dirty buffets that it refermces. An inode 
1010 can be put in a,e list of dirty inodes and/or fte list of inodes flu, have 

20 dirtybuffers. When all dirty buffers referenced by an inode are either 

scheduled to be written to disk or are written to disk, fl,e count of dirty buffers 
to inode 1010 is set to zero. The inode 1010 is ften requeued aca,rding to its 
flag ae., no dirty bufl^ers). This inode 1010 is cleared before the next inode is 
I»ocessed. Further a,e flag of fl,e inode indicating a«, it is in a consistency 

25 point is cleared, the inode 1010 itself is written to disk in a consistency point. 

Hie WAFL buffer structure is illustrated by indirect WAFL buffer 1020. 
WAFL buffer 1020 comprises a WAFL buffer data structure 1020A, a 4 KB 
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buffer 1020B comprising 1024 WAFL buffer pointers and a4 KB buffer 1020C 
comprising 1024 on-disVc block nun>bers. IKe WAFL buffer data structure is 56 
bytes in size and comprises 2 pointers. Oie pointer of WAFL buffer data 
structure 1020A references 4 KB buffer 1020B and a second pointer references 
5 buffer I020C. In Hgure 10, the 16 buffer pointers lOlOC of WAFL inode 1010 
poin..oAel6sing^indirectWAFLbuffersl020-1024 In turn, WAFL buff er 
1020 references 1024 direct WAFL buffer structures 1030-1034. WAFL buffer 

1030 is representative direct WAFL buffers. 

10 Direct WAFL buffer 1030 comprises WAFL buffer daU smicture 1030A 

and a 4 KB direct buffer 1030B containing a cadged version of a corresponding 

4KB 



on- 



disk 4 KB data block. Direct WAFL buffer 1030 does not compnse a 
buffer such as buffer 1020C of indirect WAFL buffer 1020. The second buffer 
pointer of WAFLbuffer daU structure 1030A is zeroed, and therefore does not 
,5 point to a second 4 KB buffer. TOs prevents inefficient use of memory because 
memory space would be assigned for an unused buffer otherwise. 

in the WAFL fUe system as shown in Figure 10, a WAFL in-core inode 
structure 1010 references a tree of WAFL buffer structures 1020-1024 and 1030- 
20 1034 It is similar to a tree of blocks on disk referenced by standard inodes 

comprising blodc numbers that pointing to indir«t and/or direct blocks. Thus, 
WAFL inode 1010 contains not only the on-disk inode lOlOD «.mprising 16 
volume block numbers, but also comprises 16 buffer pointers lOlOC pointing to 
WAFLbuffer structures 102M024 and 1030-1034.WAFLbuffers 1030-1034 

25 contain cached contents of blocks referenced by volume block numbers. 

The WAFL in-code inode 1010 contains 16 buffer pointers lOlOC. In 
tt™, the 16 buffer pdinters lOlOC are referenced by a WAFL buffer structure 
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lOIOB ta. roo^ ^ ^ee of WAH. buffers 102M024 »d 1030-1034. IKu., «.ch 
WABL inode 1010 contains a WAFL buffer structoe lOlOB fta. points to the ,6 
buffer pointer lOlOC in the inode 1010, IWs fadUtates algorithans for 
handling trees of buffers that are implemented reau^vely. K the 16 buffer 
5 pointers lOlOC in the inode ,010 we^ not represented by a WAFL buffer 
structure 10,03, the recursive algoHthms for operating on an enti« tree of 
buffers 1020-1024 and 103(^,034 „ould be difficult to implement 

^-^'A-"^- diagrams iUustrating modes having different levels of 
md^ection. In Hgures SA^ simplified indirect and direct WAFL buffer are 
mustrated to show indirection. However, it should be understood that the 
WAFL buffer Of Figure 9 represent «,rresponding indirect and direct buffers of 
Figure 10. For a small file having a size of 64 bytes or less, d^ is stored direcUy 
« the inode itself instead of the ,6 buffer pointer. Hgure 9A is a diagram 
15 mustrating a Uve. 0 inode 820 fl«t is the same as inode 820 shown in Figure 8 
except that inode 820 «,mprises 64-by.es of data 920B instead of ,6 buffer 
pointers 820B. Uterefore, additional buffers are not aUocated for very small 
files. 



20 
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For a file having a size of less than 64 KB, each of the ,6 buffer pointers 
dH«tly references, 4 KB direct WAFL buffer. Hgure 9B is a diagram 
mustrating a Uvd 1 inode 820 comprising ,6 buffer pointers 820B. Ti^ buffer 
pointer PTK0-PTOI5 point to cor«sponding 4 KB direct WAFL buffers 
922A-922C 

Fbr a file having a size that is greater than or equal to 64 KB and is less 
than 64 MB, each of the 16 buffer pointers references a single-indirect WAFL 
buffer. In turn, each 4 KB singl^indirect WAFL buffer comprises 1024 buffer 
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pointers fta. referent 4 KB direC WAFL buffers. Figure 9C U a diagrarr> 
illustrates a Level 2 inode 820 comprising 16 buffer pointers 820B that 
:e£eren« t6 singl^todirect WAFL buffers 930A.930C As shown in Figure 9C, 
buffer pointer FITU) points to single-indirect WAFL buffer 930A. 
5 Single^indireC WAFL buffer 930A comprises 1024 pointers that reference 4KB 
dirert WAFL buffers 940A-940C. Similarly, single-indirect WAFL buffers 
930B-930C can each address up to 1024 direct WAFL buffers. 

For a ffle size greater than 64 MB, the 16 buffer pointers of the inode 
10 reference doubleindirect WAFL buffers. Each 4 KB double-indirect WAFL 
bnffer comprises 1024 pointers pointing to corresponding single-ind^ect 
WAFL buffers. In turn, each single-indirect WAFL buffer comprises 1024 
pointers.ha.point.o4KBdirectWAFLbu«ers. Thus, up to 64 GB can be 
addressed. Hgure 9D Is a diagram Ulustratog a Uvel 3 inode 820 comprising 
15 16 pointers 820B wherein pointers Pm, FTRl, and PTO15 reference 
doubl^indirect WAFL buffers 970A, 970B, and 970C, respectively. 
Doubl^indirect WAFL buffer 970A comprises 1024 pointers that point to 1024 
singl^indirect WAFL buffers 980A-980B. Each single-indirect WAFL buffer 
980A-980B,intum,refer«.cesl024directWAFLbuffers. As shown in Figure 
^ ,O,singl.indirectWAFLbuffer980Areferencesl024directWAFLbuff^^ 

,90A-990C and single-indirect WAFL buffer 98DB references 1024 d.ect WAFL 
buffers 990D-990R 



25 



nirectones 

Diaries in the WAFL system are stored in 4 KB blocks that are 
aividedintotwosedons. Hgure 14 is a diagram illustrating a directory block 
HlOacrdingtothepresentinvention. Hach directory block 1410 comp^ a 
^tsecaonUlOAomorisin.: fixed len^di^orv entry s.ructur.X412-U14^ 
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»d a second section 1410B conWning tt.e acn.al directory nan,es 141M418 
Each direc,o:y en^ also contains a file id and a gene^tion, HUs information 
identifies what file ti,e entry references, ms information is weU-icno™, in 
the art. and therefore is not illustrated in Figu« M. Each entry 1412-14,4 in fl,e 
5 first section ,4,0A of ti» directory blodc has a pointer to its nan.e in ^ se«,„d 
section 14,0B. Purth^, each entry 1412-1414 includes a hash value dependent 
upon its name in fl« second section ,4I0B so that the name is examined only 
when a hash hit (a hash match) occurs. For example, entry M12 of ti,e first 
section MlOA comprises a hash value 1412A and a pointer 1412B. The hash 
10 value 1412A is a value dependent upon fl,e directory name 

■D^CTORV.ABC. stored in variable lengti, entiy M16 of ti.e second section 
UIOB. Pointer ,412B of entry ,410 points to fl.e variable lengU. entry ,4,6 of 
second section ,4,0B. Using fixed lengti. directoty entiles 14,2-14,4 in a,e firs, 
section MlOA speeds up a.e p,^ of name lookup. A calculation is not 
15 -l-edtofindthenexten.ryinadi.ec«yblock,410. Furtt,er, Iceeping 
entries 1412-14,4 in first section small MlOA imp„>ves tt,e hit rate for file 
systems wiU> a line-fill data cache. 



Meta-r)flf;^ 



20 



WAFL keeps information fl,at desaibes a file system in files known as 
meta-data. Meta-data comprises an in«ie file, inomap file, and a blkmap file 
WAFL stores its meta-data in files fl«t may be written anywhere on a disk 
Because aU WAFL meU-data is keptin files, it can be written to any location 
25 just like any oflier file in flie file system. 

An first meta-data file is fl,e "inode file" ti,at contains inodes describing 
an other files in ti,e file system. Hgu« ,2 is a diagram iUustrating an inode 
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fflemo. T.einodefflel2a0n.ybewri«enanywh«eonadis.»U.eprio. 

a«i„odemeinoi.self.T.einodemel210.pointed.bya„inoder^e.ed 

5 .0 as *e "root inode". The ««. inode is kept in a fixed location on disk 
^toas.i>efflesysten.infonna.ion(fsinfo,biod.describedbe.ow.T.e 

i^ae me ««i^iss«,«din4KB blocks o„disk(or4KBbu«ers in 

„en«»y). Figure H iHustrates .l.t inodes mOA-niOC a. st«ed - KB 

bu«erimForon^kinodesi^ofl28bytes,a4KBbu«er(orblock) 

,„ con.prises3.inodes..l.einco.inodeme«10isco.posedom^bu^rs 

■ A.(i. 1210A) is loaded, the on-diskinode part Of 
1220. When an incore mode (i.e.,1210A) IS loa 

, . r v..ifff.r 1220 of the inode me liiiu. 
the incore inode 1210A is copied m for the buffer 122U o 

rbu«er dataitselfis loaded ^ dis. Writins - to disk is done . *e 
.everseorder.,.eincoreinodemcA,wMa.isacop.o.tbeondrsk»ode.« 

. copied.o.l.ecorrespondingb„Her...Ooftl.einodefileimTh^«>e.^e 

BlemO.«rit.anocated.and.l>edatastoredintheb.HermOoftheu>ode 

file 1210 is written to disk. 

^„ther»eta-datameisthe.«ockn.p-(blkn.ap,fi.e. HgureUAisa 

fii^nin The blkmap file 1110 contains a 32-bit 
20 diagram illustrating a bllanap file 1110. The dikhi p 

e„^niOA.l«OC.orea*4KBbU^in*edisksyste.. ^^^-"^^L 

^l^enallocted. Hgu^nBisa ^^^'^'^^^-^l'^^^^^'' 
aemOCshowinBSurellA). As sho«n in Figure IIB, eniry llOA. 
as con,prisedo.3.bi.(Bn.Bmi). Bit 0 <Bm> o. entr, lUOA is .l>e active me 
Kt(F^Bir) The FS-bit of entry UlOA indicates whether or no, the 

rp^Z^rpartoftheacdvemesys... Bi.l.O<B™>ol 
i;LALbitstha.indi<..e«he.her.heblockisp»toUcorrespondu.g 



wo 9409807 



PCT/DS94/OS320 



-22- 

snapshot :-20. The next upper 10 bits (Bmi nmm 

yt^ lu Dits U)U21-Bn30) are reserved. Bit 31 

(BTOl) is the consistency point bit (CP-BrP) of entry niOA. 

A block is available as a free bliv-w n,. ci 
5 tBITO.Rm„- . file system when an bits 

«^d^b.o<..fr^ --^--^enceabyentryinOA.bU^apL 
mo . fiee When bits 0-3, (Bm-Bmi) all have values of 0. Hgure „o is a 

^^■"--'^S^'^n^OAofHgurenAinaica.inganaHoca.edbloc.in 
a" :rr"- "'-"""^^'-o--^— ^bit,issetto 
^ .^of theentyanoAofbUcn^paieinoindicatesablockthatispartof 

-psh.^ .any^thatreferen^thebloC Snapshot are described in 

— valueofCthisdoesnotnecessar^,,,,,,.,. 

! !f ^^-^''"'-ealsobezerofor 
.^blo...o.«3llccated.Bit3HBmi,ofe„.ry,„„;,.^,.,„^,,^^ 

s.a.easbit0(Bm)ondis..however,whenloadedin.on.en,o,ybit31 (Bmi) 
.»usedforbooklceepingaspartcfaconsis.en<yp„int. 

20 ^^--^a-'J^'a file is the .node „«p.«non«p)fne that serves as a 

fi-»odeu«p. Hgure ,3Aisadiagrar„iUus.ra.in«anino.nap file ,3,0. 

file ,3,0 contains an 8-bit entry ,3,0A-,3,0C f6r each blodc in the 
-<lefi.e,2,0showninHgure,Z Bach entry ,3,0A-,3,0C is a count of 
=^<-«li"c<lesintheco,.espondingblocko,.heinodeme,2,0. Hgure,3A 
25 ^>«'-va,uesof32,5,and0inentries,3,0A-,3,0C,respecavely. ^mode 

file ,2,0 must still be inspected to find which modes in the block are free bu, 
does not require la.ge number of random blocks to be loaded into n,en»„ 

iromdi^. S-ceeach4KB block ,220ofinodememoho.d332inodes,.fe 
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S-bit » entty 1310A.1310C for each blodc of inode file 1210 can have 
values ranging from 0 «, 3Z When a block mO of an inode file 1210 has no 
inodes in use, fte en.ry 1310A-1310C for i. in inomap file 1310 is 0. When all 
the modes in Ae blodc 1220 inode file 1210 are in use, *e entty 1310A-1310C of 
S the inomap file 1310 has a value of 32. 

Figure 13B is a diagram illusfradng an inomap fUe 1350 that references 
the 4 KB blocks 1340A-1340C of inode file 1340. For example, inode file 1340 
stores 37 inodes in three 4 KB blocks 1340A-1340C. Blocks 1340A.1340C of 
10 inode file 1340 contain 32, 5, and 0 used inodes, respectively. Enmes 

1350A.1350C of blkmap file 1350 referent blodcs 1340A-1340C of inode file 
1340 respectively. Thus, fte entries 1350A-1350C of inomap file have values 
of32 5, andOfor blocks 1340A-1340Cofinode file 1340. In turn, entries 
1350A-1350C of inomap file indicate 0, 27, and 32 free inodes in blocks 
15 1340A-1340C of inode file 1340, respectively. 

to Figure 13, using a bitmap for .he entries ISIOA-ISIOC of 
inomap file 1310 instead of counts is disadvantageous sine, it would require 4 
bytes per entry 1310A.1310C for block 1220 of the inode file 1210 (shown in 
20 Figurel2)insteadofcnebyte.FreeinodesinthebU«k(s)1220of.heinodefile 

1210 do no. need to be indicated in the inomap file 1310 because the inodes 
themselves contain that information. 

Hgure 15 is a diagram iUustradng a file system information (fsinfo) 
25 structureUmTl^rootinodelSlOBofafilesystemiskeptinafl^locaaon 
ondisksothatitcanbelocatedduringbootingofthefilesystem. Tl>efsu.o 
blodcisnotameta^tafilebutispartoftheWAFLsy^. Therootmode 
1510B is an inode referencing the inode file 1210. It is part of the file system 
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mibnnaaon (fsWo) sm.ctare 1510 that also cor.W„s mfonnaHon 15I0A 
induding number of blocks in fl.e ffle sy«em, tt,e creation time of the file 
system, etc The misceUaneous infonnaSon 1510A farther comprises a 
checksum 1510C (described below). Excptfbr the ^otinodelMOB itself this 
inibnnation 1510A can be kept in a meta^ta me in an alternate embodiment 
identical copies of the fsinfo structure 15,0 are kept in fixed locations on ' 

disk. 



Hgute 16 is a diagram illustrating the WAFL ffle system 1670 in a 
10 consistent state on disk comprising two fsinfo blocks ,610 and ,6,2, inode file 
1620. blkmap ffle ,630, inomap ffle ,640, root directory 1650, and a typical ffle 
(or directory) ,660. mode ffle 1620 is comprised of a plurality of inodes 
1620A-,620D that reference other ffles ,630.1660 in the ffle system ,670. mode 
1620A of mode ffle 1620 referents blkmap ffle ,630. mode ,620B references 
15 inomap ffle 1640. mode 1620C references root directory 1650. mode 16MD 
references a typical ffle (or directory) ,660. Thus, the mode ffle pomts *> all 
ffles 163(^1660 in the ffle system 1670 except for feinfo blocks 1610 and 161Z 
Fsmfo blocks 1610 and 1612 each contain a copy 16,0B and ,612B of the inode of 
the mode ffle 1620, respectively. Because U» root inode 16,0B and 1612B of 
20 fsmfo blocks 1610 and 16,2 describes the mode ffle 1620, that m tum describes 
the rest of the ffles 1630-1660 m the ffle system 1670 indudmg aU metasiata 
files 1630-1640, the root inode 1610B and 1612B is viewed as the «x,t of a tree of 
blodcs. IKe WAPL system 1670 uses tius tree structure for its update method 
(consistency pomt) and for implementing snapshots, both described below 
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T 4cf nf fandes Hav^n^ Diitv Blocks 

WAFL in-core inodes (Le. WAFL inode 1010 shown in Figure 10) of the 
5 WAI^ me system are maintamed in different linked lists according to thei^ 
status, modes that reference dirty blocks are kept ma dirty inode Ust as shown 
inFigure2. Irtodes containing valid data that is not dirty are kept in a separate 
Ust and inodes that have no valid data are kept in yet another, as is 
well-known in the art. The present invention utilizes a list of inodes having 
10 dirty data blocks that f adUtates finding all of the inodes that need write 
allocations to be done. 

Figure 2 is a diagram iUustratog a Ust 210 of dirty inodes according to 
the present invention. The list 210 of dirty inodes comprises WAFL in-core 
15 inodes 220-1750. As shown in Hgure 17, each WAFL in-core inode 220-250 
comprises a pointer 220A-250A, respectively, that points to another inode in 
a« linked list For example, WAFL inodes 220-250 are stored in memory at 
locations 2048, 2152, 2878, 3448 and 3712, respectively. Thus, pointer 220A of 
inode 220 contains address 2152 It points therefore to WAFL inode 222. In 
20 turn, WAFL inode 222 points to WAFL inode 230 using address 2878. WAFL 
inode 230 points to WAFL inode 240. WAFL inode 240 points to inode 1750. 
The pointer 250A of WAFL inode 250 contains a null value and therefore does 
not point to another inode. Thns, it is the last inode in the Ust 210 of dirty 
ir,odes. Each inode in the list 210 rq,resents a fUe comprising a tree of buffers 
25 as depicted in Hgure 10. At least one of the buffers referenced by each inode 
220-250 is a dirty buffer. A dirty buffer contains modified data that must be 
^tten to a new disk location in the WAFL system. WAFL always writes dirty 
buffers to new locations on disk. 
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•nie WAFL disk structure described so fer is static In ti.e present 
5 invention, changes to the file system 1670 a« tighdy controfled to n«in.ain 
the file system 1670 in a consistent state. The file system 1670 progresses from 
one self-oonsislent slate to another self-consistent sUte. The set (or tree) of 
self-oonsistent blocks on disk that is rooted by the root inode 1510B is referred 
to as a consistency point (CP). To implement consistency points, WAFL always 
10 vmtes new data to unallocated blocks on disk. It never overwrites existing 
data. Thus, as long as the root inode 1510B is not updated, the state of the file 
system 1670 represented on disk does not change However, fbr a file system 
1670 to be useful, it must eventually refer to newly written data, therefore a 
new consistency point must be written. 



IS 



Referring to Figure 16, a new consistency point is written by first 
flushing aU file system blocks to new loations on disk (including the blocks in 
meta-data files such as the inode file 1620, blkmap file 1630, and inomap file 
1640). A new toot inode 1610B and 1612B for the file system 1670 is then 
20 written to disk. With this method for atomically updating a file system, the 
ort-disk file system is never inconsistent The on-disk file system 1670 reflects 
an old consistency point up until the root inode 1610B and 1612B is written 
Immediately after the root inode 1610B and 1612B is written to disk, the file 
system 1670 reflects a new consistency point Data structures of thefile system 
25 1670 can be updated in any order, and there are no ordering constraints on disk 
«rites except the one requirement that all Modes in the file system 1670 must 
be written to disk before the root inode 16I0B and 1612B is updated. 
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To convert lo a ne« consistency point, the root taode 1610B and 1612B 
must be updated reliably and atonucally. WAFL does this by keeping two 
identical copies of the fsinfo structure 1610 and 1612 containing the root inode 
1610B and 1612B. During updating of the «x,t inode 1610B and 1612B, a first 
copy of the fsinfo structure 1610 is written to disk, and then the second copy of 
fte fsinfo structure 1612 is written. A checksum. 1610C and 1612C in the fsinfo 
structure 1610 and 1612, respectively, is used to detect the occurrence of a 
system crash that corrupts one of the copies of the fdnfo structure 1610 or 1612, 
eadi containing a copy of ttie root inode, as it is being written to disk. 
NormaUy, fte two fsinfo structures 1610 and 1612 are identical. 

Ai, ^n.m tor r.en °-='H"R ' Con.sistenCY Ppipt 

Figure 5 is a diagram illustrattag the method of producing a consistency 
pdnt. In step 510, all "dirty" inodes (inodes that point to new blocks 
containing modified data) in the system are marked as being in the consistency 
point their contents, and only their contents, are written to disk. Only when 
those writes are complete are any writes from other inodes allowed to reach 
disk. Further, during flie time dirty writes are occurring, no new 
modifications can be made to inodes that are in flie consistency point. 

In addition to setting the consistency point flag for all dirty inodes that 
are part of the consistency point, a gtobal consistency point flag is set so that 
user-requested changes behave in a tightiy »ntroUed manner. Once the 
25 global consistency point flag is set, user-requested changes are not aUowed to 
affect inodes that are in the consistency point Further, only inodes having a 
consistency point flag that is set are aUocated d«k space for their dirty btocks. 



15 



20 
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are 
are no 



Consequently, the state of the file system wiU be flushed to disk exactly as it 
was when the consistency point began. 

In step 520, regular files are flushed to disk. Hushing regular files 
5 comprises the steps of allocating disk space for dirty blocks in the regular files, 
and writing the corresponding WAFL buffe,^ to disk. The inodes themselves' 
are then flushed (copied) to the inode file. All inodes that need to be written 
are in either the list of inodes having dirty buffers or the list of inodes that 
dirty but do not have dirty buffers. When step 520 is completed, there 
10 more ordinary inodes in the consistency point, and all incoming I/O requests 
succeed unless the requests use buffers that are still locked up for disk I/O 
operations. 

In step 530, special files are flushed to disk. Flushing spedal fUes 
15 comprises the steps of allocating disk space for dirty blocks in the two spedal 
files: the inode file and the blkmap file, updating the consistency bit (CP-bit) to 
matdi the active file system bit (FS-bit) for eadi entry in the blkmap file, and 
then writing the blocks to disk. Write allocating the inode file and the blkmap 

is compUcated because the process of write aUocating fliem dianges the files 
20 themselves. Ihus, in step 530 writes are disabled while dianging these files to 
prevent important blocks from loddng up in disk I/O operations before the 
changes are completed. 

Also, in step 530, the creation and deletion of snapshots, described 
25 below, are performed because it is the only point in time when the file system, 
except for the fsinfo blodc, is completely self consistent and about to be written' 
to disk. A snapshot is deleted from the file system before a new one is created 
so that the same snapshot inode can be used in one pass. 
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Figure 6 is a flow diagram iUustrating the steps that step 530 comprises. 
Step 530 aUocates disk space for the blkmap file and the mode file and copies 
the active FS-bit into the CP-bit for each entry in the blkmap file. In step 610, 
the inode for the blkmap file is pre-flushed to the inode file. This ensures that 
the block in the inode file that contams the inode of the blkmap file is dirty so 
that step 620 allocates disk space for it. 

In step 620, disk space is allocated for all dirty blocks in the inode and 
blkmap files. The dirty blocks include the block in the inode file containing 
the inode of the blkmap file is dirty. 



In step 630, the inode for the blkmap file is flushed again, however this 
time the actual inode is written to the pre-flushed block in the mode file. Step 
15 610 has already dirtied the block of tiie inode file that contains the inode of the 
blkmap file. Hius, another write-aUocate, as in step 620, does not need to be 
scheduled. 

In step 640, the entries for each block in the blkmap file are updated. 
20 Each entry is updated by copying the active FS-bit to the CP-bit (i.e., copying 
bit 0 into bit 31) for all entries in dirty blocks in the blkmap file. 

In step 650, all dirty blocks in the blkmap and inode files are written to 

disk. 

25 

Only entries in dirty blocks of the blkmap file need to have the active 
file system bit (FS-bit) copied to the consistency point bit (CP-bit) in step 640. 
Immediately after a consistency point, all blkmap entries have same value for 
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both the active FS-bit and CP-bit As time progresses, some active FS-bits of 
blkmap ffle entries for the ffle system are either deared or set. Hie blocks of 
the blkmap file containing the changed FS-bits are accordingly marked dirty. 
During the following consistency point, blocks that are dean do not need to be 
re^opied. TTie dean blodcs are not copied because they were not dirty at the 
previous consistency point and nothing in the blodcs has dianged since then. 
Hius, as long as the file system is initially created with the active FS-bit and the 
CP-bit having the same value in all blkmap entries, only entries with dirty 
blocks need to be updated at each consistency point. 



10 



Referring to Hgure 5, in step 540, the file system information (fsinfo) 
blodc updated and then flushed to disk. The feinfo blodc is updated by writing 
a new root inode for the inode file into it. The fsinfo blodc is written twice. It 
is first written to one location and then to a second location. Hie two writes 

15 are perfonned so that when a system crash occurs during either write, a 

self-consistent file system exists on disk, llierefore, either the new consistency 
point is available if the system crashed while writing the second fsinfo blodc or 
the previous consistency point (on disk before the recent consistency point 
began) is available if the first fsinfo blodc failed. When the file system is 

20 restarted after a system failure, the highest generation count for a consistency 
point in the feinfo blocks having a coirect checksum value is used. This is 
described in detail below. 



In step 550, the consistency point is completed. This requires that an) 
dirty inodes that were delayed because they were not part of the consistency 
point be requeued. Any inodes that had their state change during the 
consistency point are in the consistency point wait (CP.WATT) queue. The 
CP.WAIT queue holds inodes that changed before step 540 completed, but 
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after step 510 when the consistency point started. Once the consistency point is 
completed, the inodes in the CP.WATT queue are re-queued accordingly in the 
regular list of inodes with dirty buffers and list of dirty inodes without dirty 
buffers. 

5 

gin glP Orderinp; Tmistraint p f rnnsifitPncv Point 

Hie present invention, as illustrated in Figures 20A-20C, has a single 
ordering constraint. -Hie single ordering constraint is that the fsinfo block 1810 

10 is written to disk only after all the other blocks are written to disk. The writing 
of the fsinfo block 1810 is atomic, otherwise the entire file system 1830 could be 
lost. THUS, the WAFL ffle system requires the fsinfo block 1810 to be written at 
once and not be in an inconsistent state. As illustrated in Figure 15, each of 
the fsinfo blocks 1810 (1510) contains a checksum 1510C and a generation count 

15 1510D. 

Figure 20A illustrates the updating of the generation count 1810D and 
1870D of fsinfo blocks 1810 and 1870. Each time a consistency point (or 
snapshot) is performed, the generation count of the fsinfo block is updated. 
20 Figure 20A iUustrates two fsinfo blocks 1810 and 1870 having generation 
counts 1810D and 1870D, respectively, tiiat have the same value of N 
indicating a consistency point for the ffle system. Botii fsinfo blocks reference 
the previous consistency point (old ffle system on disk) 1830. A new version of 
the ffle system exists on disk and is referred to as new consistency point 1831. 
25 The generation count is incremented every consistency point. 

m Figure 20B, the generation count 1810D of the first fsinfo block 1810 is 
updatedandgivenavalueofN-Hl. It is then written to disk. Figure20B 
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iUustrates a value of N+1 for generation count 1810D of fsinfo block 1810 
whereas the generation count 1870D of the second fsmfb block 1870 has a value 
of N. Fsmfo block 1810 references new consistency point 1831 whereas fsinfo 
block 1870 references old consistency point 1830. Next, the generation count 
1870D of feinfo block 1870 is updated and written to disk as illustrated in Figure 
20C In Figure 20Q the generation count 1870D of fsinfo block 1870 has a value 
of N+1. Therefore the two fsinfo blocks 1810 and 1870 have the same 
generation count value of N+1. 



10 



When a system crash occurs between fsinfo block updates, each copy of 
the fsinfo block 1810 and 1870 will have a self consistent checksum (not shown 
in the diagram), but one of the generation numbers 1810D or 1870D wiU have a 
higher value. A system crash occurs when the file system is in the state 
illustrated in Figure 20B. For example, in the preferred embodhnent of the 
15 present invention as illusti-ated in Figure 20B, the generation count 1810D of 
fsinfo block 1810 is updated before the second fsinfo block 1870. TTierefore, the 
generation count 1810D (value of one) is greater than the generation count 
1870D of feinfo block 1870. Because the generation count of the first fsinfo 
block 1810 is higher, it is selected for recovering the file system after a system 
20 crash. This is done because the first fsinfo block 1810 contains more current 
data as indicated by its generation count 1810D. For the case when the first 
fsinfo block is corrupted because the system crashes while it is being updated, 
the other copy 1870 of the fsinfo block is used to recover the file system 1830 
into a consistent state. 

25 

It is not possible for both fsinfo blocks 1810 and 1870 to be updated at the 
same time in the present invention. Therefore, at least one good copy of tiie 
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fstafo block 1810 and 1870 e«sts in the ffle system. Tlus allocs the file system 
to alsways be recovered into a consistent state. 

WAFL does not require special recovery procedures. TWsisunlike 
5 prior art systems that use logging, ordered writes, and mostly ordered writes 

with recovery. TOs is because only data corruption, which RAID protects 
against, or software can corrupt a WAFL ffle system. To avoid losing data 
when the system fails, WAFL may keep a non-volatUe transaction log of all 
operations that have occurred since the most recent consistency point. TOs 

10 log is completely independent of the WAFL disk f«ma. and is required only to 
prevent operations from being lost during a system crash. However, it is no. 
required to maintain consistency of the fUe system. 



15 



20 



rj^^ti,^^ A rnnsistency Point 

As described above, dumges to the WAFL file system are tightly 
„ntroUed to maintain the file system in a consistent state. Hgures 17A.17H 
illustrate the generation of a consistency point for a WAFL file system. Ihe 
g^ation of a consistency point is described with reference to Figures 5 and 

In Hgures 17A-17L, buffers that have not been modified do not have 
asterisks beside then. Therefore, buffers contain the same dau as 
corresponding orwiisk blod.. Thus, a bl«* may be loaded into memory but tt 
has not dinged with respect to its on disk version. A buffer with a single 
25 asteriskObesideitindlcatesadirtybufferinmemorydtsdataismodified). A 
buffer with a double asterisk n beside it indicates a dirty buffer that has been 
allocateddiskspac.Finany,abufferwithat,ipleasterisk(-.,isadirtybuffer 
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that is «rt«en into a new block on did. TWs convenUon for denoting the 
state of buffers is also used with respect to Figures 21 A-21E. 

'^''^"^'^^'"-''^'--'istJSSOofinodeswithdir.ybufferscomprising 

5 -<«'-23«^»d2306B.I„odes2306Aand2306Breferenc«.reesofbu«e. 
where a, least one buffer of ead„ree has been n»dified. IniHally the 

insistency pointfUgs 239, ,nd2392o,inodes23,«A and 2306Ba«cIeared(0) 

WhUealistJSSOofinodeswithdirtybuffersisiUustratedforthep^sent ' 
system, it should be obvious to a person sMled in the art that other lists of 
modes n.ay exist in n,eu»^. p„ , ^. ^ ^ 

no. have dirty bu«e« is Stained in memoty. T^ese inodes must also be 
marked as being in the „>nsisten<y point. I.ey must be flushed to disk also to 
™te the dirty intents of the inode file to disk even though the dirty inodes 
do not reference dirty blocks. lWsisdonei„step520ofHgure5. 

Kgu« 17B is a diagram ffluslraling a WAR file system of a previous 
cons-sten-y poi„, ^^^^ ^^^^ ^ ^ ^ ^^^^^ ^ ^ 

and files 2340 and 2342. Hie 2340 emprises blocks 2310-2314 containing data 
A , "B", and "C", respectively. Hie 2342 comprises data blocks 2316-2320 
20 comprising data "D",^., and respectively. Blkmap fUe 2344 —es 
block2324. T'»inodefile2346«mprises.wo4KBblocks2304and2306 
second block 2306 comprises inodes 23«A-2306C that reference file 2340 file 
2342, and blkmap ffle 2344, respectively. IWs is iUustrated in block 2306 by 
hsting the file number in ti,e inod,. Fsinfo block 2302 comprises the root 
25 mod. Therootinodereferen«sblocks2304and2306ofinodefUe2346 Ihus 
Hgure ,7B iUustiates a tree of buffers in a iiie system rooted by the fsinfo bbdc' 
2302 containing the root inode. 
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Hgu« ITC is a diagram illustrating two modified buffers for blocks 2314 
and 2322 in memory. The active file system is modified so that the block 2314 
containing data "C" is deleted from file 2340. Also, the data T' stored in btock 
2320 is modified to "F-prime", and is stored in a buffer for disk block 2322. It 
5 should be miderstood that the modified data contained in buffers for disk 

blocks 2314 and 2322 e=dsts only in memory at this time. AU other blocks in the 
active file system in Figure 17C are not modified, and therefore have no 
asterisks bedde them. However, some or aU of these blocks may have 
corresponding dean buffers in memory. 

10 

Hgure 17D is a diagram illustrating the entries 2324A-2324M of the 
blkmap file 2344 in memory. Entries 2324A-2324M are contained in a buffer for 
4 KB block 2324 of blkmap file 2344. As described previously, Bno and 81131 
are the FS-BIT and CP-BIT, respectively. The consistency point bit (CP-BIT) is 
15 set during a consistency point fo ensure ftat the corresponding block is not 
modified once a consistency point has begun, but not finished. Bm is the first 
snapshot bit (described below). Blkmap entries 2324A and 2324B illustrate that, 
as shown in Hgure 17B, the 4 KB blocks 2304 and 2306 of inode file 2346 are in 

the active file system (FS-BTT equal to 1) and in the consistency point (CP-BIT 
20 equal to 1). Similarly, the other blocks 2310-2312 and 2316-2320 and 2324 are in 
the active file system and to the consistency potat However, blocks 2308, 2322, 
and 2326-2328 are ndther to the active file system nor in the consistency p^nt 
(as todicated by BTIt) and BTTSl, respecdvely). The entry for deleted block 2314 
has a value of 0 in the FS-BTT indicating that it has been removed from the 
25 active file system. 

In step 510 of Kgure 5, all "dirty" ^ *^ ^l^*™ "^"^ " 

being in the consistency potot. Dirty toodes todude both inodes that are dirty 
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andinodesthat^fereneedirtybuBers. Figu» m illustnites a lis, of taodes 
wia. di«y buffers where cons^^ncy potat flags 2391 and 2392 of inodes 
2306A and 2306B are set („. „^ ^ 

"C"of«e2340„Md.ls*,bede.etedfton.a.eac«vefUesys.en.. Inode2306B 
S <.fl.lock2306ofinodefUe2346refere„^fi,e234Z Blodc 2320 containing data 
"P has been modified and a new blodc <x,nuining dau -P- must be allocated 
In step 510, the dirty inodes 2306A and 2306B are copied into the buffer for 
block 2308. Ihe buffer for block 2306 is subse<paently written to disk (in step 
530X -nus is iUustrated in Figure ITE. ne modified data e«sts in memory 
10 <'-X-»dthebuffer2308ism„keddir.y.Iheinconsisten<ypoi„,fl,g,^,, 
and 2392 of inodes 2306A and 2306B are then dear^d (0) as iUustrated in Figure 
17A. Ihis releases the inodes for use by other processes. 

in step S20, regular files are flushed to disk Ihus, block 2322 is aUocated 
15 d^kspaoe. 2314 of file 2340 is to be deleted, therefore nothi.^ occurs to 
aus block unm the consistency point is subse,uenUy completed. Block 2322 is 
wntten to disk in step 520. TOs is illustrated in Hgure 17F where buffers for 
Modes 2322 and 2314 have been Written h, disk (marked by ~) The 
intennediate aUocafion of disk space (») is r.t shown. The inodes 2308A and 

20 23»8BofbIock2308ofinodeflle2346areflushedtotheinodefiIe. Inode2308A 
<rfblock2308 references blocks 23I0and23I2offile2346. Inode2308B 
^ferences blocks 2316, 2318, 2322 for file 2342. As illustrated in Hgure I7F disk 
space is allocated for block 2308 of inode 2346 and for direct bfock 2322 for file 
2342. H6wever,thefilesys,emitselfhasnotbeenupdated. 7hus,thefile 

25 system remains in a consistent state. 

In step 530, the blkmap file 2344 is flushed to disk. TOs is iUustrated in 
Figure 17G where the blkmap file 2344 is indicated as being dirty by the asterisk. 
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In step 610 of Rguie 6, the inode for the blkmap file is pre-flushed to the 
taode file as illustrated in Figure 17H. Inode 2308C has been flushed to block 
230B of inode file 2346. However, inode 2308C still references block 2324. In 

5 step 620, disk space is aUocated for bltanap file 2344 and inode file 2546. Block 
2308 is allocated for inode file 2346 and block 2326 is allocated for blkmap file 
2344. As described above, block 2308 of inode file 2346 contains a pre-flushed 
inode 2308C for blkmap file 2344. In step 630, the inode for the blkmap file 2344 
is written to the pre-flushed block 2308C in inode 2346. Thus, incore inode 

10 2308C U updated to reference block 2324 to step 620, and is copied into the 

buffer in memory containing block 2306 that is to be written to btock 2308. This 
is illustrated in Figure 17H where inode 2308C references block 2326. 

In step 640, the entries 2326A-2326L for each block 2304-2326 in the 
15 blkmap file 2344 axe updated in Figure 171. Blocks ftat have not changed since 
the consistency point began in Figure 17B have the same values in their 
entries. The entries are updated by copying BTO (FS-bit) to the consistency 
point bit (Bini). Block 2306 is not part of the active file system, therefore BITO 
is equal to zero (Biro was turned off in step 620 when block 2308 was aUocated 
20 to hold the new data for that part of the inode file). This is illustrated in Kgure 
17J for entry 2326B. Similarly, entry 2326F for block 2314 of file 2340 has BITO 
andBmi equal to zero. Block2320of file 2342 and block 2324 of blkmap file 
2344 are handled similarly as shown in entries 2361 and 2326K, respectively. In 
step 650, dirty block 2308 of inode file 2346 and dirty bk>ck 2326 of blkmap file 
25 2344 are written to disk. This is indicated in Figure 17K by a triple asterisk (-) 

beside blocks 2308 and 2326. 
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"^-^8 •°H8-S, in Step 540, fte file system Wo™aa„„b^ 2302 
^J'^^o^^^^^onr^^^, I^us.<^obIoc.2302isdWed.a 

*e„«^He„.<«sk (indicated by,«p,e asterisk, inngu^elTL. InHgu^ei^ 
-^e.i.oMoc.2302isiU....e. As sho™ in the aa^^n,, ,sin.„ Hoc. ' 
2302nowrefere„cesb,ock2304and23a8oftheinodeffle2346. InHg„„,^ 

blod.2305isno,onge.pa«cf.heincdeffle2346i„.heac«vefflesyste. ' 
Sunaarly, flie 2340 ^e,^ced by inode 2308A o, inode me 2346 comprises 

blodcs23I0and23Ii Blodc 23U is no longer pan of ffle 2340 in this 
consistencypoint. Hie 2342 comprises blcclcs 23,6, 2318, and 2322 in the new 

2308 o,modefile2346referen<«,„ewbn™ap file 2344 comprisingbloc.2326. 

updated by copying the inode of the inode fUe 2346 into fsinfo block 2302 

Hov,ever,«,eb.c^23H2320,2324,and2306ofthep.viousco„sistenc. 
P-t r^ain on disk. I^ese blocks are never overwritten when updating the 
file system to ensure that both the old consisten<y point 1830 and the new 
co-btency point 1831 exist on disk in Figure 20 during step 540 

20 SNAPSHOTC 



-me WAR system supports snapshots. A snapshot is a readonly copy 

of an entire file system at a given instant when the snapshot is seated A 
newly seated snapshot refers to exacUy the same disk blocks as the acdve file 
25 system does. The,^„, ^ ^^ated in a small period of time and does not 
consume any additional disk space. Only as data blocks in the acfive file 
sy.^ are modified and written to new locations on disk does the snapshot 
begin to consume extra space. 
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WAFL supports up to 20 different snapshots that are numbered 1 
through 20. Thus, WAFL aUows the Creadon of multiple "dones" of the same 
file system. Each snapshot is represented by a snapshot inode that is similar to 
S the representation of the active file system by a root inode. Snapshots are 
created by dupUcating the root data structure of the file system. In the 
preferr^i embodiment, the root daU structure is the root inode. However, any 
data structure representative of an entire ffle system could be used. The 
snapshot inodes reside in a fixed locfion in the inode file. The limit of 20 
10 snapshot is imposed by the size of the bllcmap entries. WAFL requires two 
steps to create a new snapshot N= copy the root inode into the inode for 
snapshot N; and, copy bit 0 into bit N of each blkmap entry in the blkmap fUe 
Bit 0 indicates the blodcs that are referenced by the tree beneath the root inode. 



15 



The result is a new file system tree rooted by snapshot inode N that 
references exactty the same disk Modes as the root inode. Setting a 
corresponding bit in the bllonap for ead. blodc in the snapshot prevents 
snapshot blodcs f«>m being freed even if the active file no longer uses the 
snapshot blodcs. Because WAFL always writes new data to unused disk 
20 locations, the snapshot tree does not d^e even though the active file system 
changes. Because a newly oeated snapshot tree references exactly the same 
blodcs as fl,e root inode, it consumes no additional disk space. Over time, ti.e 
snapshot referents disk blodcs fl^t would oti-erwise have been freed. Thus, 
over time the snapshot and the active file system share f^er and fewer blodcs, 
25 and the space consumed by tt.e snapshot increases. Snapshots can be deleted 
when aiey consume unacceptable numbers of disk bloAs. 
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The lis, of acave snapshots along «itt. fte nan,es of fte snapshots is 
stored in a metadata iile caUed the snapshot di,^^. disk state is 
updated asdesoibedabove. As «ith aU other changes, the update occ„„ by 
autcnatically advancing from one consistency point to another. Modified 
bloc are ™i«en .o »used locations on the dis. after which a ne. in„de 
descnbmg .he updated fUe system is written. 



10 



15 



Figure 18A is a diagram of the file system 1830, before a snapshot is 
taten. Where leve. of indirec«on have been removed to provide a simpler 
overview of the WAFL file system. Ihe file system 1830 represents the file 
system 1690 of Figure 16. tte file system 1830 is »mprised of blocks 1812-1820 

inode of the inode file is contained in fsinfo block 18,0. While a single 
copy Of the fsinfo block 1810 is shown in Hgure 18A, it should be understock 
««fase«>ndcopyoffsini.blod.exis.so„disk. 11,e inode 1810A contained in 
the femfo block ,810 comprises 16 pointers that point to 16 blocks having the 
samelevelofindirecfio^. blocks 18,2-1820 in Hgure ,8A reptesent aH 
blocks m the file system ,830 including direct blocks, indirect blocks, etc 
20 ll,ough only five blocks 1812-1820 a« shown, each block may point to other 
blocks. 

Figure 18B is a diagram iUustrating the aeation of a snapshot. Ue 

sna^hotismadefortheentirefilesystemisaobysimplycopyingtheinode 
25 1810A of the inode fiie that is stored in fsinfo block ,8,0 into the snapshot 

mode ,822. By copying the inode 18,0A of the inode file, a new file of inodes is 

c-ated representing a.e same file system as the active file system. Because the 

mode ,8,0A of the inode file itself is «,pied. No other blocks ,812-1820 need to 
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be dupUcated. The copied inode or snapshot inode 1822, is then copied mto the 
inode file that dirties a block in the inode file. For an inode file comprised of 
one or more levels of indirection, each indirect block is in turn dirtied. This 
process of dirtying blocks propagates through all the levels of indirection. Each 
4 KB block in the inode file on disk contains 32 inodes where each inode is 128 
bytes long. 



The new snapshot inode 1822 of figure 18B points back to the highest 
level of indirection blocks 1812-1820 referenced by the inode 1810A of the inode 
10 file when the Snapshot 1822 was taken. Tlie inode file itself is a recursive 
structure because it contains snapshots of the file system 1830. Each snapshot 
1822 is a copy of the inode 1810A of the inode file that is copied into the inode 
file. 

15 Figure 18C is a diagram iUustrating the active file system 1830 and a 

snapshot 1822 when a change to the active file system 1830 subsequentiy occurs 
after the snapshot 1822 is taken. As illustrated in tiie diagram, block 1818 
comprising data "D" is modified after the snapshot was taken (in Figure 18B), 
and therefore a new block 1824 containing data "Dprfme" is allocated for the 

20 active file system 1830. Hius, the active file system 1830 comprises blocks 1812- 
1816 and 1820-1824 but does not contain block 1818 containing data "D". 
However, block 1818 containing data 'Ty' is not overwritten because tiie WAFL 
system does not overwrite blocks on disk. The block 1818 is protected against 
being overwritten by a snapshot bit that is set in the blkmap entry for block 

25 1818. Tlierefore, the snapshot 1822 still points to the unmodified block 1818 as 
weU as blocks 1812-1816 and 1820. The present invention, as iUustrated in 
Figures 18A-18C, is unlike prior art systems that create Mones" of a file system 
where a done is a copy of all the blocks of an inode file on disk. Thus, tite 
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entire contend of fte prior art inode files are dupUcated r«juirmg large 
amo^ts (MB) of disk space as weU as requiring subsUnfial toe for disk I/O 

operations. 

5 As the active file system 1830 is modified in Hgure 18C, it uses mo« 

disk space because the file system «,mprising blocks 1812-1820 is ™t 
ovenvritten. m Figure 18C, block 1818 is iUustrated as a direct blodc. However 
m an actual file system, block ,818 may be pointed to by indi«ct block as weU ' 
n^us, when block 18,8 is modified and stored in a new disk location as block 
10 1824, the corresponding direct ar.d indirect blocks are also copied and assigned 
to the active file system 1830. 

Figure ,9 is a diagram iUustradng the changes occurring in bU>ck 1824 of 
Rgure 18C Block ,824 of Figure ,8C is represented within dotted Iir.e ,824 in 
IS Hgure ,9. Hgure 19 iUustnites several levels of indirection for block ,824 of 
ngure 18C. 11« newblock 19,0 that is written to disk in Hgure 18C is labeled 
1910 in Figure 19. Because block ,824 comprises a dau block 19,0 containing 
modified data that is «fe«nced by double indi^cfion, two other blocks ,9,8 
and 1926 are also modified, pointer 1924 of single-inditect block ,918 
20 ^ferences new block ,9,0, therefore block ,9,8 must also be written to disk in a 
new location. Similarly, pointer 1928 of indirect block 1926 is modified because 
« points to block 1918. "merefore, as shown in Figure ,9, modifyi,^ a data 
block 1910 can cause several indirect blocks 1918 and 1926 to be modified as 

well. TWs requires blocks ,9,8 and ,926 to be written to disk in a new locafion 
25 as well. 

Because the direct and indirect blocks ,910, 19,8 and ,926 of data block 
1824 of Hgure ,8C have changed and been written to a new locafion, the inode 
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to fte taode file is written «, a new block. The modified block of the mode file 
is allocated a new blodc on disk stooe data cannot be overwritten. 

As shown to Hgitte 19, block 1910 is pomted to by tadirect blocks 1926 
and 1918, respectively. Thus when block 1910 is modified and stored to a new 
disk location, the correspondtag direct and todirect blocks are also copied and 
assigned to the active file system. T1.US, a nmnber of data structures must be 
updated. Chan^g direct block 1910 and todirection blocks 1918 and 1926 
causes the blkmap file to be modified. 

•n,e key data structures for snapshots are the blkmap entries where eadi 
entry has multiple bits for a snapshot This enables a pluraUty of snapshots to 
be created. A snapshot is a picture of a tree of blocks that is the file system (1830 
of Hgme 18). As long as new data is not written onto blocks of the snapshot, 
the file system represented by the snapshot is not changed. A snapshot is 
similar to a consistency potat. 



•nie ffle system of the present invention is completely consistent as of 
the last time the fsinfo blocks 1810 and 1870 were written. Therefore, if power 
20 is interrupted to the system, upon restart the file system 1830 comes up in a 

consistent state. Because 8-32 MB of disk space are used in typical prior art 
Mone" of a 1 GB file system, dones are not conducive to consistency points or 
snapshots as is the present invention. 

25 Referring to Figure 22, two previous snapshots 2110A and 2110B exist on 

disk. At the tostant when a third snapshot is created, the root inode pomttog 
to the active file system is copied toto the toode entry 2110C for the third 
snapshot to the toode file 2110. At the same tto>e to the consistency potot toat 
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goes fluough, a flag indicates that snap*« 3 has been created. TOe entire file 
system is processed by checking if Bnt) for each entry in the blkmap ffle is set 
(1) or Cleared (0). All the BHt, values for each bllcnup entry are copied into the 
plane for snapshot three. When completed, every active block 2110.2116 and 
5 1207 m the file system is in the snapshot at the instant it is taken. 

Blocks that have existed on disk continuously for a given length of time 
«e also present in corresponding snapshots 2I10A-2110B preceding the third 
snapshot2110C ^"o* has been in the file system ..r a long enough period 
of t,me, it is present in aU the snapshot. Block 1207 is such a blod. As shown 
in Figure 22, block 1207 is referen«d by inode 2210G of the active incde file 
and mdirectly by snapshots 1, 2 and 3. 



lUe sequential order of snapshots does not necessarily represent a 
15 Chronological sequence of file system copies. Each individual snapshot m a file 
system can be deleted at any given time, thereby making an entry available for 
subsequent use. When BITO of a blkmap entry that rel^c^ the acfive file 
system is cleared (indicating the blodc has been deleted fi„m the active file 
system), the block cannot be reused if any of the snapshot reference bits are set 
20 TW-isbecausetheblockispartofasnapshotthatisstiUinuse. Ablockcan 
only be reused when all the bits in the blkmap entry are set to zer». 

Algorithm f,,^ 9"ieraHn|> ^ Pr1r^-r 



25 



Creating a snapshot is ahnost exactly like creating a regular consistency 
pomt as shown in Figure 5. In step 510, aU dirly inodes are marked as being in 
the consistency pomt. In step 520, aB regular files are flushed to disk to step 
530, special files (ie., the inode file and the blkmap file) are flushed to disk In 
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step 540, the f sinfo blocks are flushed to disk. In step 550, all inodes that were 
ivot in the consistency point are processed. Figure 5 is described above in detail. 
In fact, aeating a snapshot is done as part of aeating a consistency point. The 
primary difference between aeating a snapshot and a consistency point is that 
5 all entries of the blkmap file have the active FS-bit copied into the snapshot bit. 
The snapshot bit represents the correspondmg snapshot in order to protect the 
blocks in the snapshot from being overwritten. Hie creation and deletion of 
snapshot is performed in step 530 because that is the only point where the file 
system is completely self-consistent and about to go to disk. 



10 



Different steps are performed in step 530 then illustrated in Figure 6 for 
a consistency point when a new snapshot is aeated. The steps are very similar 
to those for a regular consistency point. Figure 7 is a flow diagram iUustrating 
the steps that step 530 comprises for creating a snapshot. As described above, 
15 step 530 allocates disk space for the blkmap file and the inode file and copies 
the active FS-bit into the snapshot bit that represents the corresponding 
snapshot in order to protect the blocks in the snapshot from being overwritten. 

In step 710, the inodes of the blkmap file and the snapshot being created 
20 are pre-flushed to disk, m addition to flushing the inode of the blkmap file to 

a block of the inode file (as in step 610 of Figure 6 for a consistency point), the 
inode of the snapshot being created is also flushed to a block of the inode file. 
IWs ensures that the block of the inode file containing the inode of the 
snapshot is dirty. 

25 

m step 720, every block in the blkmap file is dirtied. In step 760 
(described below), all entries in the blkmap file are updated instead of just the 
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enWes in dirty blocks. Thus, ali blocks of fte blk™.p flfe ^ „ 
here to ensure that step 730 write-allocates disk spa« for them. 

'"^'^'^''^''^''^o^'edforandirtyblocksintheinodeand 
5 blka«p files. The dirty blocks include the blod: in the inode file containing 

^.einode of the bllonap file, Which is dirty^nd the block ^ntainlng the mode 
for the new snapshot. 

'"'"^P^*'' of the «ot inode for the file system are copied 
mfo the inode of the snapshot in the i„«le file. At this time, evety block that 

- part of the new consistency point and that will be bitten to disk has disk 
^ anocated for it Thus, dupUcting the root inode in the snapshot inode 
ea^vely copies the enfire active file system. actual blocks that will be in 
the snapshot are the same blocks of fte active file system. 



IS 



m step 750, the inodes of «.e blkmap file and the snapshot are copied to 
into the inode file. 



to step 760, entries fa the blkmap file are updated. In addition to 
20 copying the active FM.t to the CP-bit for the enMes, the active F^it is also 
copied to the snapshot bit corresponding fo the new snapshot 

to step 770, an dirty blocks m the blkmap and inode files are wriUen to 

disk. 



25 



FHiaUy, at some time, snapshots themselves are removed from the file 
system in step 760. A snapshot is removed from the file system by clearing its 
snapshot inode entry in the inode file of the active file system and clearing 
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each bit corresponding to the snapshot number in every entry in the bllonap 
file. A count is perfonned also of each bit for the snapshot in ail the bUonap 
entries that are cleared from a set value, thereby providing a count of the 
blocks ihat are freed (corresponding amount of disk space that is freed) by 
5 deleting the snapshot the system decides which snapshot to delete on the 
basis of the oldest snapshots. Users can also choose to delete spedfied 
snapshots manually. 

The present invention limits the total number of snapshots and keeps a 
10 blkmap file that has entries with muldple bits for tracking the snapshots 
instead of using pointers having a COW bit as in Episode. An unused block 
^ an zeroes for the bits in its blkmap file entry. Over time, the BIIO for the 
acttve fUe system Is usually turned on at some instant Setting BITO identifies 
the corresponding blodc as allocated in the active file system. As indicted 
15 above, aU snapshot bits are initially set to zero. If the acttve file bit is cleared 
before any snapshot bits are set the btock is not present in any snapshot stored 
on disk. Therefore, the block is immediately available for reallocafion and 
cannot be recovered subsequently from a snapshot 



20 



ncnpratiorl "f 1 Snapshot 



AS described previously, a snapshot is very similar fo a co„s«tency 
point The^fore, generation of a snapshot is described with reference to the 
25 differences between it and the generation of a consistency point shown in 

01 A oiTJ illustrates the differences for generating a 
Figures 17A-17L. Figures 21A-21F illustrates uie uu 

snapshot. 
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Hgmes I7A.17D fflustxate fte state of fl.e WAFL file ^.em when , 
snapshot is beg™. AH dirty inodes a« n«.ked a. being in the consistency 
Pou.tins.ep5:0and.eg„,a„„esareflushed.odis.ins.epS20. T,:.,!^ 
I^'-^o^^-apshotis.denacal.othatforaa.nsis.en.ypoin.P^ 

5 fo'^'-^Psl.otdiifersinstepSaOfrom.hatforaconsis.encypoint Ihe 
foBovring describes processing of a snapshot according to Hgure 7. 

Ihe following description is for a se«,nd snapshot of the WAFL file 
system. Afirs.snapsho,is««rdedinthebllanapenMesofFigurel7C As 
10 nulicated in enMes 2324A.2324M, blodcs 2304-2306, 2310-2320, and 2324 are 
contamed in the fi,., snapshot. All other snapshot bits (BIH-Bmo) are 
a^umed to have values of 0 indicating U«t a corresponding snapshot does not 

e^stondi^ illustrate, the fae system after steps 510 and 520 a« 

completed. 



15 



h step 710, inodes 2308C and 2308D of snapshot 2 and blkmap file 2344 
are pre-fludted to disk Ihis ensures that the block of the inode ffle that is 
Somgh>con,ain.hesnapshot2inodeis dirty. In Figure 21B, inodes 2308C and 
2308D are pre-flushed for snapshot 2 and for blfanap file 2344 

20 

In step 720, the entire blianap file 2344 is dirtied. This win cause the 
enti^bllonap file 2344 to beallocated disk spaceinstep 730. m step 730, disk 
space is allocated for dirty blocks 2308 and 2326 for inode file 2346 and blkmap 
file 2344 as shown in Hgure 21C. Ihis is indicated by a triple asterisk (".) 

25 '-"»e'"ocl=s2308and2326.-nusisdiffe«ntf™mgeneratingac„nsis.en<y 
pom, where disk space is aUocated only for blodcs having entries that have 
changed in the blkmap file 2344 in step 620 of Figure 6. Blkmap file 2344 of 
Hgure 21C comprises a single block 2324. However, when blkmap file 2344 
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than one block, disk space is allocated for all the blocks in step 



10 



In step 740, the root inode for the new file system is copied into inode 
2308D for snapshots Instep750,theinodes2308Cand2308Dof blkmap file 
2344 and snapshot 2 are flushed to disk as illustrated in Figure 21D. live 
diagram illustrates that snapshot 2 mode 2308D references blocks 2304 and 2308 
but not block 2306. 

In step 760, entries 2326A-2326L in block 2326 of the blkmap file 2344 are 
updated as illusirated in Figure 21E. The diagram illustrates that the snapshot 
2 bit (Bin) is updated as weU as the FS-BIT and CT-BIT for ead, entry 
2326A.2326L. Thus, blocks 2304, 2308-2312, 2316-2318, 2322, and 2326 are 
contained in s«>pshot 2 whereas blocks 2306, 2314, 2320, and 2324 are not. m 
15 step 770, the dirty bloAs 2308 and 2326 are writteh to disk. 

Further processing of snapshot 2 is identical to that for generation of a 
consistency point illustrated in Hgme 5. In step 540, the two fsinfo blocks are 
flushed to disk. Thus, Figure 21F represents the WAFL fUe system in a 
20 consistent state after this step. Hies 2340, 2342, 2344, and 2346 of the insistent 
file system, after step 540 is completed, are indicated within dotted lines in 



Figure 21F. In step 550, the consistency point is completed by processmg 
that were not in the conastency point 



inodes 
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Access Timp ^T70TT^^^•^■p^ 

Unix file systems must maintain an "access time" 
Alime indicates the last time that the file was read. It is "^^^ • . 
5 the file is accessed. Consequently, when a file is read th^^^^^^ ^ ^^"^ 
the inode in the inode file is rewritten to update the inoc:^ ^^^^^'"^ 
disadvantageous for creating snapshots because, as a con^^- """"^^ 
file could potentially use up disk space. Further, reading "^X.^ "^"^^ 
system could cause the aitire inode file to be duplicated. 'fading a 

10 invention solves this problem. file 

^present 

Because of atime, a read could potentially consum^ 
modifying an inode causes a new block for the inode file ^^:|^ 
Further, a read operation could potentially fail if a file s>. J""^ V^i'^'"" 
15 an abnormal condition for a file system to have occur . 

^ which is 

In general, data on disk is not overwritten in the Iv 
to protect data stored on disk. The only exception to this 

overwrites for an inode as illustrated in Figures 23A-23B ^ ^ 

20 overwrites" occurs, the only data that is modified in a bio' ^^^^^^ 
the atime of one or more of the inodes it contains and th 

the same location. Tliis is the only exception in the WA^J"^^^ / ^ 
new data is always written to new disk locations. ^>St^'' '^"^^^ ^ 

otherwise 

25 In Figure 23A, the atimes 2423 and 2433 of an inod^ 

WAFL inode file block 2420 and the snapshot inode 2432 q^^^^ 

2420 are illustrated. Inode 2422 of block 2420 references di,.^^^ ^^^^ "^'^ 

atime 2423 of inode 2422 is "4/30 9:15 PM" whereas the aH>!'^ ^O'"" ^'"^ I 

The ' 
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inode 2432 is "5/1 10:00 AM". Figure 23A iUustrates the file system before 
direct buffer 2410 is accessed. 

Hguie 23B iUustra«s the inode 2422 of direct block 2410 after direct block 
5 2410 has been accessed. As shown in the diagram, the access time 2423 of inode 
2422 is overwritten with Ihe access time 2433 of snapshot 2432 fl«t references 
it Thus, the access time 2423 of inode 2422 for direct btock 2410 is "5/1 11.23 
AM". 

10 Allowing inode file blocks to be overwritten with new atimes produces 

a slight inconsistency in the snapshot The atime of a fUe in a snapshot can 
actuaUy be later than the lime fl>at the snapshot was created. In order to 
prevent users from detecting this inconsistency, WAFL adjusts the atime of all 
fUes in a snapshot to the time when the snapshot was actuaUy created instead 
15 of the time a fae was last accessed. This snapshot time is stored in the inode 
that describes the snapshot as a whole. Thus, when accessed via the snapshot, 
the a«^ time 2423 for inode 2422 is always reported as "5/1 lOKlOAM". TWs 
occurs both before the update when it may be expected to be "4/30 9:15PM", and 
after the update when it may be expected to be "5/1 1123AM-. When accused 
20 through the active file system, the times are reported as "4/30 9:151^ and 

"5/1 lia3AM" before and after the update, respectively. 

In this manner, a method is disclosed for maintataing a file system in a 
consistent state and for creating readn^nly copes of the ffle system. 
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CLAIMS np THE INyPIsmnxT 



1. A method for generating a consistency point comprising the steps 

of: 



5 



marking a pluraHty of inodes pointing to a pluraHty of modified blocks 
in a file system as being in a consistency point; 
flushing regular files to storage means; 
flushing special files to said storage means; 

10 flushing at least one block of file system infonnation to said storage 

means; and, 

requeueing any dirty inodes that were not part of said consistency point. 

2. The method of daiml wherein said step of flushing said special 
15 files to said storage means fiirther comprises the steps of: 



pre-flushing an inode for a blockmap file to an inode file; 
aUocating space on said storage means for all dirty blocks in said inode 
and said blockmap files; 

flushing said inode for said blockmap file again; 

updating a pluraKty of entries in said blockmap file wherein each entry 
of said plurality of entries represents a block on said storage means; and, 

writing all dirty blocks in said blockmap file and said inode file to said 
storage means. 



20 
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